IAnimal: a cross-species omics knowledgebase for animals

Abstract With the exponential growth of multi-omics data, its integration and utilization have brought unprecedented opportunities for the interpretation of gene regulation mechanisms and the comprehensive analyses of biological systems. IAnimal (https://ianimal.pro/), a cross-species, multi-omics knowledgebase, was developed to improve the utilization of massive public data and simplify the integration of multi-omics information to mine the genetic mechanisms of objective traits. Currently, IAnimal provides 61 191 individual omics data of genome (WGS), transcriptome (RNA-Seq), epigenome (ChIP-Seq, ATAC-Seq) and genome annotation information for 21 species, such as mice, pigs, cattle, chickens, and macaques. The scale of its total clean data has reached 846.46 TB. To better understand the biological significance of omics information, a deep learning model for IAnimal was built based on BioBERT and AutoNER to mine ‘gene’ and ‘trait’ entities from 2 794 237 abstracts, which has practical significance for comprehending how each omics layer regulates genes to affect traits. By means of user-friendly web interfaces, flexible data application programming interfaces, and abundant functional modules, IAnimal enables users to easily query, mine, and visualize characteristics in various omics, and to infer how genes play biological roles under the influence of various omics layers.


INTRODUCTION
With the rapid development of high-throughput sequencing technology, the quantity of data in omics layers has increased dramatically. The integration analysis of multiomics data has brought unprecedented opportunities for the interpretation of gene regulation mechanisms and the comprehensive analysis of biological systems (1). For example, the Encyclopedia of DNA Elements (ENCODE) project aims to precisely and comprehensively delineate the segments that encode functional elements in the human and mouse genomes using large amounts of multi-omics data, which include genome, transcriptome and epigenome data; 926 535 human candidate cis-regulatory elements (cCREs) and 339 815 mouse cCREs have been identified so far (2). The Functional Annotation of ANimal Genomes (FAANG) project is working to decipher the function of genome segments with multi-omics data, and to date it has completed the analysis of 14 animals, including pigs, cattle, and salmon (3). However, several key challenges have emerged in the development and utilization of multi-omics data. First, various types of complicated data sources and different descriptive standards of data notably increase the difficulty of data collection and cleaning. Second, the huge amount of omics data requires efficient methods of data analysis, storage, and retrieval. Finally, intelligent methods need to be developed to integrate, mine, and interpret various types of omics data.
Compared with model animals, like mice, multi-omics integration research progress for livestock animals (e.g. pigs), companion animals (e.g. cats), and wild animals (e.g. pandas) lags far behind. One of the main reasons is that the data volume of these animal species is relatively small, at ∼0.2-5% that of mice (Supplementary Figure S1). Additionally, the publicly available omics data for these animal species are not well standardized because the data come from different projects, and there is a lack of unified methods for systematic collation of basic sample information, quality control, and analysis, which makes these data difficult to reuse.
There is evidence that reusing publicly available omics data facilitates new biological discoveries. For example, in our previous study (4), almost all publicly available mi-croRNA (miRNA) data for pigs were collected, cleaned, and analyzed, which tripled the annotation number of pig miRNAs, and this also improved the integrity of annotation information for half of the known miRNAs. Therefore, many studies have tried to clean and normalize the highly heterogeneous omics data of animals. For example, Genome Variation Map (GVM) mainly focuses on genome variation (5), Ruminant Genome Database (RGD) focuses on ruminant gene functional research (6), and The Animal QTL Database (AnimalQTLdb) provides abundant quantitative trait loci (QTL) information of animals (7). However, existing databases focus mainly on a single type of omics, but multi-omics data that include DNA, RNA and proteins are necessary to reveal causal relationships between genes and traits from a holistic perspective of biological systems. Meanwhile, applying massive multi-omics data of model animals like mice to build an integrative multi-omics model to adapt to other animal species is expected to break the bottleneck caused by insufficient data. The multi-omics data of multiple species can in turn be used to refine the study of gene function in model animals. Therefore, it is important to develop a platform for the comprehensive collection of multi-omics data on various animal species and the facilitation of cross-species omics research.
At present, most animal omics databases follow a strategy wherein omics data are analyzed in advance, and fixed conclusions are provided to users. This strategy is very effective for solving specific problems, but it sacrifices the flexibility and reusability of data, and it indirectly wastes computing resources and time. For example, in Animal-eRNAdb (8), enhancer RNA (eRNA) expression level can be queried easily in all available individuals for a given species, but the data need to be re-downloaded, cleaned, and analyzed if the expression level in a particular subset of individuals needs to be fetched. Similarly, GVM (5) allows users to query genotype information easily on any particular marker site of all available individuals in a given species, but it does not support queries or computational operations at the individual level. More flexible databases are required to offer omics data processing at the individual level, which can significantly promote the reusability and mining efficiency of omics data.
The integrated analysis of multi-omics data is a difficult problem to solve. The commonly used method is data stacking, which is relatively simple to implement but has high false positive and false negative rates that can be effectively reduced by designing appropriate statistical models for specific data sets in the case of ignoring the limitation of specific scenarios and experiments. With the arrival of the third development wave of artificial intelligence, deep learning has become one of the most promising research methods for multi-omics integration due to its good compatibility with heterogeneous data and its powerful big data processing capabilities (9). In one study, a convolutional neural network (CNN) model was utilized to integrate the information of genome, transcriptome, and quantitative trait loci/gene/nucleotide (QTX) in pigs and to provide a score to assess the causal relationship of each 'gene-trait' pair (10). Compared with single omics, the CNN model trained by multi-omics data improved the mining efficiency of key genes underlying specific traits, but the limited multi-omics data for pigs also posed an obstacle to the further improvement of scoring accuracy. Making full use of massive, crossspecies multi-omics data through transfer learning is expected to solve this problem in some animal species with insufficient multi-omics data. In addition, the rapid development of natural language processing technology makes it possible to efficiently mine gene-trait relationships in a large quantity of literature, thus helping to predict gene functions and to improve the interpretability of results from integrated analysis of multi-omics data.
In this study, we constructed IAnimal, which is an individual-level, cross-species, multi-omics knowledgebase. It includes individual level omics data for genome (WGS, whole genome sequencing), transcriptome (RNA-Seq, RNA sequencing), and epigenome (ChIP-Seq, chromatin immunoprecipitation with high-throughput sequencing, and ATAC-Seq, assay for transposase-accessible chromatin with high-throughput sequencing) data for 21 animal species, including mice, pigs, cattle, chickens and macaques. In addition, IAnimal also contains a large quantity of literature abstracts to reveal how each omics affects the traits through genes. Unified standards were used to clean, analyze, and structure these omics data based on engineering approaches and crowdsourcing ideas. Data-application programming interfaces (APIs) were also developed at the individual level to settle upon a convenient approach for the use of structured data.

Data collection
Genome data, high-throughput omics data, and information extracted from the literature of 21 animal species (including mice, pigs, cattle, chickens and macaques) were collected to construct a cross-species, multi-omics, knowledgebase. Because the quantity of data for mice far exceeds those of other species, a certain number of representative samples were selected by excluding highly similar ones. In contrast, data for other species were collected as comprehensively as possible. Genome sequences and annotations of all species were obtained from the Ensembl database (11), high-throughput sequencing reads were downloaded from the SRA (12) and EBI (13) databases, and literature abstracts were acquired from the NCBI database (14) through the Entrez interface. After quality control on these data, the final information used in IAnimal included 2 794 237 literature abstracts and 61,191 individual level omics data from WGS, RNA-Seq, ChIP-Seq and ATAC-Seq and genome annotation information for 21 species. The scale of clean data was approximately 846.46 TB (Table 1).
How to efficiently collect, clean, analyze and store large omics data from widely distributed sources, different data formats and uneven quality is always a great challenge. Considering the various characteristics of omics data, unified standards and platforms were designed in this study. First, an automatic download, analysis, and storage system for omics data was developed using technologies such as Docker, Nextflow (15) and PostgreSQL. Then, for highthroughput data that needed manual cleaning, an NGS cleaning program (Supplementary Figure S2) based on the idea of 'crowdsourcing' was established, in which volunteers viewed and processed the data simultaneously, and potential errors were corrected by mutual verification. Meanwhile, the Label Studio (16) platform (https://github.com/ heartexlabs/label-studio) was employed to conduct online labeling for the literature.

Gene family and core gene set analysis
To facilitate the cross-species comparison of genes, the longest coding transcript was retained for isoforms, and Or-thoFinder (V2.5.4) (25) was applied to group them into 30 206 clusters. Consistent with a previous study (26), we also defined core gene sets that are common to all species at a certain phylogenetic level and potentially dispensable gene sets that show presence/absence variations across species at the same phylogenetic level. According to the distribution of genes for each species in these clusters, the core and dispensable gene families were counted in different evolutionary branches, and core genes of different species were identified simultaneously at different phylogenetic levels, which included phylum, class, order, family, genus, and species (Supplementary Figure S3B).

Processing of RNA-seq data
Like WGS data, all collected RNA-seq datasets were also processed through a standard bioinformatics pipeline. After conversion and trimming, the remaining high-quality reads were aligned against the reference sequence by HISAT2 (V2.2.1) (35), and then alignments were fed to StringTie (V2.1.7) (36) to assemble the transcripts and to quantify the expression levels of all genes. To ensure the accuracy of quantification, samples with aligned reads >6 million were retained. To prevent interference from abnormal samples, the median value was applied to represent the gene expression in the heatmap, and outliers were deleted by using the method of Tukey's fences in the boxplot. The specific formula was as follows: where Q1 and Q3 represent the first and third quartiles of Euclidean distance observations, respectively, and k is a nonnegative constant, where k = 1.5 or k = 3 indicates an 'outlier,' and k was set to 3 in this study. At last, the Pearson correlation coefficient between two genes without considering tissue type, breed, or developmental stage in a species was defined as gene co-expression coefficient (GCC).

Processing of ChIP-seq and ATAC-seq data
All collected ChIP-Seq and ATAC-Seq datasets were first required to pass conversion and quality control with both fastp (V0.12.4) (28) and Chromap (V0.  the comparison of enrichment signals in the specified region of different samples, the genome was divided into bins with a length of 200 bp in which the enrichment signals were counted by bigWigAverageOverBed (V2.0) (40). It is worth noting that because the amount of ChIP-Seq and ATAC-Seq data for the vast majority of species was much smaller than that of RNA-Seq and WGS data, a relatively loose filtering criterion was established, namely that only samples with <2000 peaks were deleted. Users can flexibly select interesting samples for mining and visualization through the interface provided by IAnimal.

Processing of literature data
The BioBERT (41) and AutoNER (42) (43) and Vertebrate Trait Ontology (44). Based on these two models, the gene and phenotype entities were identified in all literature abstracts, and the union was obtained. To offer convenience for exploring the relationships between genes and traits, gene entities were mapped to both gene ID and gene name, and only the sentences that contained both genes and traits were kept for query, feedback, and visualization.

SYSTEM DESIGN AND IMPLEMENTATION
IAnimal is a decomposing system primarily based on the Vue front-end and SpringBoot back-end framework. To facilitate the storage and invocation of big omics data, we IAnimal is freely available to the public, accessible on both computers and mobile devices without login or registration, and it has been optimized for multiple browsers, including Chrome (recommended), Internet Explorer, Opera, Firefox, Microsoft Edge and Safari.

Overview of IAnimal
IAnimal is committed to helping users excavate gene functions by using big, cross-species, multi-omics data, which can make full use of massive public data and, simultaneously, reduce the energy consumption caused by tremendously repetitive calculations. Based on engineering and crowdsourcing concepts, IAnimal completes data collection and analysis efficiently ( Figure 1A, B), develops flexible data APIs to facilitate data invocation and excavation, and provides user-friendly functional modules to make the knowledgebase easy to use ( Figure 1C). The current implementations of the IAnimal knowledgebase contain 25 modules in five core sections (Genome, Transcriptome, Epigenome, Literature and Tools) and three additional auxiliary sections (Taxonomy, Download and Help). The core sections are mainly developed for the purpose of convenient data query, excavation, and visualization, and the auxiliary sections help users obtain additional information and documents provided by the knowledgebase. Users can browse and preview the functions of candidate genes rapidly through the gene search module located on the homepage. This module integrates various omics information of genes to help users explore their potential functions, then users can jump to the relevant omics section to explore the functions of each gene at a specific omics level and, finally, a D1316 Nucleic Acids Research, 2023, Vol. 51, Database issue series of relevant toolsets can be applied to the downstream excavating analysis of gene functions.

Gene Search module with integrated multi-omics information
A quick way to utilize multi-omics information is to search the genes of interest through the Gene Search module located on the homepage, which supports searching by gene name, gene id, genomic region or functional annotation (Figure 2A). Through the advanced search function, users can perform more flexible gene searches, which include batch search and screening of large-scale genes. When there are many search results, users can filter the search results by gene expression level in the specified tissue, the type of mutation contained in the gene, or the gene function given by the literature group ( Figure 2B). The results integrate 'basic information', 'sequence', 'structure' (Figure 2D), 'functional annotation', 'expression levels' (Figure 2E), 'variant', 'literature entities', 'homologous genes', 'peak signal' and 'gene network' for all genes, and users can infer the potential biological functions of the genes quickly from this information ( Figure 2C). Here, the omics information of genes is integrated mainly by using the default parameters, where users can explore the functions of candidate genes through specific modules and data APIs in IAnimal.

Genome section
The Genome section contains six modules: Gene Annotation, Gene Family, Core GeneSet, Genome Information, Variation and Population. The Gene Annotation module is used mainly to help users query the annotation of a specified gene in databases such as Swiss-Prot, KEGG, GO, Pfam and InterPro. The Gene Family module is designed to query genes and gene families, to explore gene functions from the gene family level, and then to realize the comparison of gene functions within and between species (Supplementary Figure S3A). The Core GeneSet module provides conserved/dispensable gene families in different evolutionary branches and conserved genes of all species at different phylogenetic levels (e.g. phylum/class/order/family/genus/species, Supplementary Figure S3B). Users can download relevant information by interacting with the visual images. The Genome Information module provides basic information on the genome in IAnimal, which is convenient for obtaining the same genome for downstream analysis. The Variation module is the most important function in this section. With the aid of this module, users can retrieve interesting variant loci in the form of variant ID, gene ID/Name and genome region ( Figure 3A), and users can also construct one or more interesting subpopulations through breed information or sample ID ( Figure 3B). To make full use of individual information to construct subpopulations, this section also provides a Population module to help users understand the basic information and evolutionary relationships of samples ( Figure 3C). The Variation module will calculate gene frequencies for all specified subpopulations, so users can quickly compare the similarities and differences of variant loci among these subpopulations ( Figure 3D). Users can further filter the variant loci of interest based on the comparison results among these subpopulations and obtain detailed annotation information for these variant loci and their distribution in all samples (Figure 3E, F). Furthermore, the genotype data of all individuals can be obtained through the download interface provided by this module to achieve more flexible downstream analysis and exploration ( Figure 3E). In addition, to facilitate users to visualize the genotype of specified samples, IAnimal also provides the Genotype Plotter module based on our flexible data API ( Figure 3H). Users only need to input the variant ID and sample ID of interest, and the module will output the high-quality genotype image, which can be used for publication directly.

Transcriptome section
The Transcriptome section contains three modules: Gene Expression, Gene Network, and GCC Comparison. Users can retrieve the expression level of the gene of interest in different samples through gene ID/Name or genome region, and batch search is also available for multiple genes ( Figure  4A). Since the sample size of the transcriptome is generally large, users often expect to compare the expression levels of genes across several specific subgroups. Therefore, this module provides two modes (custom grouping and quick grouping by tissue) to help users generate subgroups of interest rapidly ( Figure 4B). Finally, the expression levels of genes in each subgroup are displayed in a heatmap ( Figure  4C), and users can select the genes and subgroups of interest from the heatmap to be displayed in a boxplot for comparison ( Figure 4D).
The Gene Network module in this section can also construct a GCC matrix for all genes. Users can obtain and visualize the gene set (target genes) related to the specified gene (query gene) and indirect genes related to the target genes ( Figure 4E) through the gene ID and the GCC threshold (the default setting is that the absolute value of the GCC is >0.5). By default, only the top 10 genes in the absolute value of GCC are displayed, and the user can increase or decrease the number of genes to be displayed by changing the corresponding parameters. To compare the differences in the regulation patterns of genes in different species (Figure 4F), this section also provides a GCC Comparison module to obtain the GCC of a specified gene set in two different species. Users only need to select two species and enter a gene set to visually compare the GCC among the gene sets between the two species.

Epigenome section
The Epigenome section contains five modules: Signal View, Peak Search, Signal Plotter, Signal Comparison and Data Matrix. Using the Signal View module, the enrichment signals of specified regions in different targets and tissues can be obtained by searching gene ID or genomic region. The Signal View module provides two modes, selection by target/tissue and custom grouping, which helps users construct any number of subgroups ( Supplementary Figure S4), and the retrieved results will be exhibited in the heatmap ( Figure 5A). To make it easier for users to customize subgroups with sample information, this section also provides the Data Matrix module to help users view the epigenomics data in IAnimal more intuitively (Supplementary Figure S5). In addition to enrichment signals, users can also view enrichment peaks and their statistical information in a specified genome region through the Peak Search module ( Figure 5B). By clicking the link in the results, the genome coverage of the sample corresponding to the peak can be conveniently viewed in the JBrowse module ( Figure  5C). In addition, although the coverage of a specified region for the samples of different targets and tissue near a specified gene can be viewed through the JBrowse track file provided by IAnimal, it is difficult to merge and to visualize a large number of samples in JBrowse. We implemented the Signal Plotter module by using IAnimal's flexible data API, which can merge samples in the specified group and return a publication-level vector diagram ( Figure 5D) and users can specify one or more groups for visualization. IAnimal also provides a Signal Comparison module to easily reveal potential links between ChIP-seq, ATAC-seq and expression levels of given genes across species. Using this module, users can easily compare the signals and expression levels of a given gene between two species ( Figure 5E).

Literature section
The Literature section includes the two modules: Entity Search and Entity Cloud. Users can retrieve gene or phenotype entities in the Entity Search module, which will return detailed descriptions and abstract information related to the corresponding entities ( Figure 6A); then, users can comprehensively evaluate the potential functions of the specified genes and the potential regulatory gene sets of the specified traits. Because these entities are derived from machine learning models, false positives cannot be avoided completely. This module also provides a convenient feedback function to optimize the model continuously to improve Nucleic Acids Research, 2023, Vol. 51, Database issue D1319    The results of the Entity Search module, using the IGF2 gene as an example. The results contain genes and phenotype entities, and users can click the corresponding sentence or abstract to view the detailed information. (B) The feedback interface for Gene-Trait relationships. This allows users to give feedback on the reliability of relationships between genes and traits. (C) The feedback interface for entity recognition. Users can feed back the accuracy of entity recognition, which will assist the continuous optimization of the named entity recognition model. (D) The word cloud image generated by the Entity Cloud module, using the IGF2 gene as an example. From the word cloud image, users can infer that IGF2 may be related to muscle growth. (E) The word cloud image generated by the Entity Cloud module, using coat color trait as an example. From the word cloud image, users can infer that the KIT gene plays an important role in regulating this trait. the accuracy of entity recognition ( Figure 6B, C). To facilitate intuitive exploration of gene functions and trait-related genes, this section also provides the Entity Cloud module, which displays the search results as a word cloud so that the information provided by the literature is clear at a glance ( Figure 6D, E).

Tools section
The Tools section contains five modules: JBrowse, BLAST, Primer, Enrichment and Data API. The JBrowse module enables users to visualize genomes, genes, variants, ChIP-Seq, and ATAC-Seq signals for 21 species at the genomewide level and to derive high-quality vectorgraphs. Through the BLAST module, users can align the specified nucleic acid sequences or protein sequences to genomics, CDS, cDNA, ncRNA, and protein sequences of specified species online, which is convenient for sequence function research. With the Primer module, users can design primers for downstream experimental validation. The Enrichment module can be used to perform GO and KEGG functional enrichment analysis on a specified gene set. The Data API module is the basis for the efficient use of multi-omics big data in the IAnimal knowledgebase. The API interface helps users acquire multi-omics data more flexibly for personalized analysis and visualization; it provides 12 types of interfaces, namely, Species, Expression, Genes, Variation, Epigenome, Literature, Homology, Gene NetWork, Annotation, Gene Family, Statistics and Plotter. By referring to the demo, users can obtain the data of interest. However, in contrast to the simpler modules in the Tools section, the use of the Data API module requires certain programming skills and experience. In the future, easier, faster, and more convenient online tools will be generated for these interfaces to meet the requirements of users worldwide.

Taxonomy, download, and help modules
The Taxonomy module mainly introduces the species in this study and their omics data, which is convenient for users to obtain the basic information for each species. The Download module was designed to obtain genome-related files and various omics information used in the knowledgebase for local excavation. The Help module contains the introduction, user manual, FAQs, and update&news for IAnimal, in which users can obtain detailed information about Nucleic Acids Research, 2023, Vol. 51, Database issue D1323 the database and provide valuable comments and constructive suggestions.

SUMMARY AND FUTURE DIRECTIONS
With the continuous development of experimental techniques and sequencing technology, multi-omics data have exhibited hyper-exponential growth. However, it is still a major challenge to unite and utilize these very large data sets to systematically explore the genetic mechanisms that underlie the formation of a trait, especially in the domain of animal studies. Most existing animal databases, such as AnimalTFDB (48), AnimalQTLdb (7), Animal-ImputeDB (49) and Animal-eRNAdb (8), focus mainly on a single type of omics data. In this area of research, IAnimal is currently the most comprehensive multi-omics database, covering the largest number of animal species. At present, IAnimal includes 61 191 individual level omics data (e.g. WGS, RNA-Seq, ChIP-Seq and ATAC-Seq) and genome annotation information for 21 animal species, and its scale of clean data is 846.46 TB. IAnimal includes a novel deep learning model developed based on the BioBERT and AutoNER algorithms. This model mines the relationship between 'gene' and 'trait' by using 2 794 237 abstracts to learn the regulation pattern of different omics layers and effects of genes on traits.
By means of a user-friendly web interface, IAnimal enables users to easily query, mine, and visualize the features of genes in various omics, such as gene expression profiles in different tissues, gene networks among genes, genotyping results of variant sites, and enrichment signals around genes for different transcription factors or histones. By aid of flexible data APIs and abundant functional modules within IAnimal, users can utilize cross-species multi-omics information to mine for gene functions. With the explosive increase in the scale of multi-omics data for animals and the rapid development of deep learning frameworks such as Transformer, developing more intelligent integrated multiomics analysis methods to interpret the relationships between genes and traits will be a direction for future work.
It should be noted that IAnimal focuses mainly on WGS, RNA-Seq, ChIP-seq, ATAC-Seq and literature data. In the future, with the increasing data volume of highthroughput/resolution chromosome conformation capture (Hi-C), whole genome bisulfite sequencing (WGBS), and other omics data types, we will continue to expand omics data and enrich IAnimal with new types of omics data. In addition, although flexible data APIs in IAnimal enable personalized data analysis, modules to facilitate downstream data analysis and visualization based on these APIs still need to be enriched. Overall, IAnimal will be committed to providing comprehensive, structured multi-omics data for a wide range of animal species as well as relevant, intelligent integration analysis algorithms and corresponding mining and visualization tools. IAnimal is a valuable resource for producing unprecedented knowledge to fill the gap between genomes and phenomes.

DATA AVAILABILITY
IAnimal is freely available to the public at https://ianimal. pro/.